CS7643 Deep Learning - Module 1 (Intor to Neural Networks)

Lesson 1: Linear Classifiers and Gradient Descent

Components of a Parametric Learning Algorithm

Multiclass linear classifier:

  • You can interpret this as 3 independent linear classifiers
  • Note that you can move bias term into the weight matrix, and a "1" at the end of the input. This results in a one matrix-vector multiplication. Why? Efficiency.

Interpreting a Linear Classifier

How we can interpret:

  • Algebraic: We can view an image as a linear algebra computation. Given a matrix of pixel values, each can be tied to a weight+bias.
  • Visual: We can convert weight vector back into the shape of the image and visualize. After optimization image looks like an average over all inputs of class.
  • Geometric: We can view linear classifier as a linear separator in a dimension space.

Limitations:

  • Not all classes are linearly separable.
  • XOR, bullseye function not linearly separable

Performance Measure for a Classifer

Converting Scores to Probabilities

Use softmax function to convert scores to probabilities:

$$ s = f(x,W) \\ P(Y=k|X=x)=\frac{e^{s_k}}{\sum_j e^{s_j}} $$

Steps:

  1. Given score vector $s$ (score for each class)
  2. Feed this vector through softmax function (2nd equation, right)
  3. For each class $k$, exponentiate the score for class $k$ and divide it by the sum of exponentions for all the scores.
  4. This normalize all of the scores from zero to one, which fits the definition of a probability.

Performance Measure

Example of multiclass SVM loss:

Takeaways:

  • Loss is zero if the score for $y_i$ (ground truth label) is greater or equal to the scores of all the other (incorrect) classes, plus 1. (First equation)
    • Goal is to maintain a margin between the score for the ground truth label and all the other possible categories.
  • If not, penalize it by how different it is from this margin.
    • Take the max over all classes that are not the ground truth and compute the loss (second equation)
    • Take its sum over all classes that aren't the ground truth and penalize the model whenever the score for the ground truth itself is not bigger by a particular margin.
  • This type of loss is called a hinge loss.

Example of how this is calculated with image class prediction:

  • Car score is highest (wrong)
  • Take max of either 0 or wrong score - grouth truth score + 1
  • In this case, car score will incur loss but frog will not (since it's lower than cat).
  • Final loss is the sum of each diff (in this case, 2.9)

Loss for classification

Cross-entropy and MLE:

  • For multiclass loss, we use the negative log probability of the softmax output:

Example using the image classification task:

Takeaways:

  1. Given raw scores, we exponentiate it (orange box)
  2. Then normalize to get probabilities (green box)
  3. Take the negative log of the probability assigned to the one that matches ground truth (e.g. "cat")
  4. We don't take other probabilities into account for the loss. Why?
    1. Probabilities inherently induce comptetition
    2. Optimization algorithm can both boost weights for cat and decrease weight for other classes.

Loss for regression / probability

Regularization term

Regularization accounts for choosing simpler models over complex ones. This is applied to the loss function:

L1 regularization norms the weight vector: $$ L_i = |y-Wx_i|^2+|W| $$

Linear Algebra View: Vector and Matrix Sizes

Size of weights and inputs

  • c = number of classes
  • d = dimensionality of input
  • $W$: Size = $[c\cdot(d+1)]$
  • $x$: Size = $[(d+1)\cdot 1]$

Dimensionality of derivatives - Conventions

Given a scalar $s \in \mathbb{R}^1$ and vector $v \in \mathbb{R}^m$:

  • Size of the partial derivative of $v$ wrt $s$: $\mathbb{R}^{m\cdot 1}$
    • ie. column vector of size $m$
  • Size of the partial derivative of $s$ wrt $v$ $\mathbb{R}^{1\cdot m}$
    • ie. row vector of size $m$

Q: Given 2 vectors, what's the size of $\frac{\partial v^1}{\partial v^2}$?

  • Answer: Jacobian matrix. Contains all the partial derivatives wrt v1 and v2.

Q: Given a scalar and a matrix, what's the size of $\frac{\partial s}{\partial M}$?

  • Answer: Matrix containing derivative of the scalar wrt each element in the matrix

Q: What is the size of $\frac{\partial L}{\partial W}$?

  • Answer: Loss (scalar) * Weights (matrix) will return a matrix containing derivative of the loss wrt to each element in the matrix.

Jacobians of Batches

Takeaway:

  • Batch size affects the dimensionality of our data
  • Tensors: multidimensional matrices
  • Can be unwieldy - instead, flatten input to vector and get a vector of derivatives.

How is Deep Learning Different?

What makes it different:

  1. Representation learning - takes in raw data rather than processed form (ie. histogram)
    • Conducts feature extraction - automatically pulling features from raw data
  2. Uses neural networks
  3. Can tackle various ML tasks (unsupervised, RL, etc)

What is deep learning:

Features: Trad vs Deep Learning

Features are engineered in traditional ML:

Features are automatically extracted in deep learning:

  • Key is hierarchical compositionality, meaning all data has some hierarchical order which neural networks can represent.

Example of features for image detection:

Building a complicated function

Representation of data is done through building up a network of simple functions into a complex network.

  • This idea is similar to boosting where we employ an ensemble of weak learners.
  • Significance: We can use any differentiable function to construct the network (sin, quadratic, log, etc).

End-to-End Learning

"End-to-end": Learning is applied to entire spectrum, from raw data -> feature extraction -> classification.

  • No handcrafted feature extraction, auto-extracted. Distinction between extracted features and classifier is blurry.
  • Key Idea: Learn very separable feature representation that can be easily classified.

Gradient Descent

Derivatives

Algorithm:

Applying batch gradient descent:

Convergence notes:

Computing Gradients

How to compute $\frac{\partial L}{\partial W_i}$?

  1. Manual differentiation
    • Labor intensive
    • Can't compute closed form solution for complex functions
  2. Symbolic differentiation
    • Similar to manual
  3. Numerical differentiation
    • Works for any function
    • Computationally expensive
  4. Automatic differentiation
    • Used by most DL libaries

Manual Differentiation

Derivation of update rule using squared loss:

  • Note on 2nd to last summation on update rule:

    The partial derivative of this summation (with respect to $w_j$ is really causing most of the terms to be zero because when $i$ is not equal to $j$ then none of those weights actually affect $w_j$.

(Some more context on getting the partial derivative of update rule above)

Update rule once we add non-linearity (sigmoid) - Gets more complex:

Decomposing a Function

Manual differentiation can get messy. We can decompose the complicated function into modular sub-blocks.

Distributed Representations

Key ideas:

  • No single neuron "encodes" everything
  • Groups of neurons work together as a distributed representation

Distributed representation: Toy example

  • One-hot labels of shapes (left)
  • By distributing characteristics (right), can efficiently cover a wide range of shapes by combining them.

Lesson 2: Neural Networks

Neural Network View of a Linear Classifier

(Think of this as another view in combination with the linear algebra view)

Origins of the term

Output can be modulated by a non-linear functions (e.g. sigmoid)

Connecting Many Neurons

Terms:

  • Each input/output is a neuron (node)
  • A linear classifier is called fully connected layer
  • Connections represented as edges
  • Activation: Output of a particular neuron
  • This will be expanded as we view computation in a NN as a graph

The magic of NN is that we can stack multiple layers together:

From a linear algebra view, a 2-layer NN coresponds to adding another weight matrix:

We can build deeper network by ading more layers.

  • 2-layer can solve any continuous function
  • 3-layer can theoretically solve any function
  • However, number of nodes could grow unreasonably ($O^2$ or worse) wrt the complexity of the function.
  • However, we still need to figure out the initial architecture (how many layers, how many nodes). This is not learned.

Computation Graphs

Adding even more layers:

Compositionality

  • The world is compositional - we want our model to reflect this.
  • Empirical and theoretical evidence that having this hierarchy makes learning complex functions easier.
  • Note that prior SOTA hand-engineered features often had this compositionality as well.

Computing Gradients in a Complex Function

Problem:

  • We are learning complex models with significant amount of parameters (millions/billions)
  • How do we compute the gradients of the loss (at the end) with respect to internal parameters?
  • Intuitively, want to understand how small changes in weights deep inside are propagated to affect the loss function at the end.

Answer:

  • View the function as a computation graph
  • Specifically, a DAG (directed acyclic graph)
  • Modules must be differentiable to support gradient computations for gradient descent.
  • A training algorithm will process this graph, one module at a time.

Example representation of a function as a graph:

  • Represents an ordering of computations
  • Significane: Allows us to know how to compute it in reverse (importing to backprop)

Backpropagation

Overview of training

Forward pass

Backwards pass

Computing Local Gradients: Example

Use matrix calculus to get derivatives of local gradients:

Computing the gradients of loss:

  • Goal: For a given module, get the partial deriv. of the loss wrt its inputs, or the partial deriv. of the loss wrt to its weights.
  • Problem: Tricky part is that Loss is all the way at the end of the computation graph. How do we compute these terms?
  • Answer: Use the chain rule
    • Chain rule: If you have a particular graph that goes from $x$ to some intermediate variable $y$ to $z$, then $\frac{\partial z}{\partial x}$ is equal to the product of $\frac{\partial z}{\partial y}$ and $\frac{\partial y}{\partial x}$
    • Intuition: If we want to know how $z$ changes if we make tiny change to $x$, we need to understand how $y$ changes if we make a small change to $x$. Then, multiply that by how $z$ change if $y$ changes.

Applying Chain Rule

Neural Network Training

Summary:

  1. Forward Pass: Compute loss on mini-batch
  2. Backward Pass: Compute gradients wrt parameters.
    • Each layer has its upstream loss and change in loss wrt the inputs of the function or wrt parameters.
    • Each layer computes gradients wrt its parameters to perform gradient descent.
    • Then computes gradient of the loss wrt to inputs in order to send this back to the previous layer.
    • Now the previous layer knows how the loss changes if its output changes.
    • This continues up to the first layer. Note that last layer doesn't need to compute gradient of the loss wrt to inputs since there are no more inputs.

Key Idea: Backpropagation is the application of gradient descent to a computation graph via the chain rule.

Backpropagation and Automatic Differentiation

Key Idea:

  • Auto-diff is a generalization of back propagation.
  • Reverse-mode auto-diff: Iterates from last module backwards, applying the chain rule to a DAG.
  • Decomposes a function into very simple primitive functions (addition, multiplication) where we already know what the derivatives are.
  • Create framework where we can just define the computation graph using simple primitives so we don't need to worry about the backwards gradients. In other words, we don't need to write code that actually computes the gradients of the function.

Deep Learning = Differentiable Programming

Computation as a Graph:

  1. Input = Data + Parameters
  2. Output = Loss
  3. Scheduling = Topological ordering
    • ie. which computations have to be done first.

Automatic Differentiation: A family of algorithms for implementing chain-rule on computation graphs.

Example Computation Graph:

Partial derivatives from $a_3$ upstream:

Notes:

  • For nodes with more than one paths, you need to sum gradients from multiple paths.

Patterns of Gradient FLow

Different operations have different effects on the gradient.

  • Addition operation distributes gradients along all paths.
  • Multiplication is a gradient switcher, ie. multiplies it by the values of the other term.
  • Max operation selects which path to push the gradients through. Gradients will only flow through the path that was selected to be the maximal value. This information of the flow must be recorded in the forward pass.

Key Idea: If gradients do not flow backwards properly, learning slows or stops

Computational Implementation

Key Ideas:

  • Explicitly store computation graph in memory and corresponding gradient functions.
  • Nodes broken down to basic primitive computations - corresponding derivative is known.
  • In the past, you performed NN computation by specifying both the forward function (ie. $W^T \cdot X$) and manually compute the backwards function for each layer.
  • In auto-diff paradigm, all we need to do is put together a forward computation graph and all of these gradients will be computed for us.

Forward Mode Automatic Differentiation

Key ideas:

  • Different from reverse mode auto-diff, which starts from the output.
  • Forward mode starts from the input and propagates it forward.
  • Complexity is proportional to input size.
  • However, in most ML tasks, our input sizes are huge while outputs (ie. loss) are small. Thus not common in DL.

Computation Graphs in PyTorch

Key ideas:

  • Modern libraries builds out computation graphs on the fly as the code executes.
  • This allows backprop to be very simply executed (backward()). Don't need to define forward/backwards function like in the past.

Power of Automatic Differentiation

Power of auto-diff stems from the idea of Differentiable programming:

  1. Not limited to mathematical functions.
  2. Can employ control flows (if, loops) and backpropagate through these algorithms.
  3. Can be done dynamically so that gradients are computed, then nodes added, and you can repeat this.

Computation Graph Example for Logistic Regression

Linear Classifier: Logistic Regression

Components:

  1. Input: $x \in R^D$
  2. Binary label: $y \in {-1, +1}$
  3. Parameters: $w \in R^D$
  4. Output prediction (e.g. sigmoid): $$ p(y=1|x) = \frac{1}{1+e^-w^T x} $$
  5. Loss (e.g. log loss): $$ L = \frac{1}{2}||w||^2 - \lambda \log (p(y|x)) $$

Key Idea: Machine learning functions (input -> model -> loss) is also a computational graph

  • This means we can use the computed gradients from backprop/auto-diff to update the weights.

Example Gradient Computations

Key Idea: All we need to do is define the forward function, no backwards.

Vectorization and Jacobians of Simple Layers

Key Idea: Chain rule can be computed via a series of operations on scalars, vectors, and matrices.

Example of logistic regression:

Fully Connected (FC) Layer: Forward Function

Example of input/weights/output dimensions of a fully connected layer:

Sizes of Jacobians (gradients):

  • This video is dense. Rewatch link to video
  • Key Ideas:
    • Not taking partial derivative on entire $W$ matrix bc it would lead to tensors (bad). Note: partial of a vector wrt a matrix results in a tensor
    • Instead, take row-wise partial of $h^l_i$ wrt $w_i$. This results in sparse matrix. Each output only affected by corresponding weight row, others are zero.

Rectified Linear Unit (ReLU)

ReLU is a substitue for the sigmoid.

  • Provides non-linearity.
  • Better gradient flow than sigmoid (why?)
  • Performed element-wise (see figure below) - just applies max operation between 0 and input $h^{l-1}$.
  • No parameters (just doing max operation)

(ReLU and other activation functions):

Jacobian of ReLU

Key Ideas:

  • Gradient of inputs less than zero will be zero. Remember that ReLU takes either max or zero.
  • Full Jacobian of ReLU is large: input x output dimensions ($|h^l \cdot h^{l-1}|$).
    • However, Jacobian of ReLU is sparse - Only diagonal values will have non-zero elements.
    • Why? Output value is only affected by corresponding input value, nothing else.
  • Final Jacobian is:
$$ \frac{\partial h^l}{\partial h^{l-1}} = \begin{cases}1,& \text{if }h^{l-1} \gt 0 \\ 0, & \text{otherwise} \end{cases} $$

Aside: Why better gradient flow than sigmoid? (ChatGPT)

Rectified Linear Units (ReLU) and sigmoid activation functions behave differently in terms of gradient flow during the backpropagation process in neural networks, which can impact training efficiency. ReLU tends to provide better gradient flow than sigmoid for a few reasons:

  1. Vanishing Gradient Problem: Sigmoid activation functions have a range between 0 and 1, which causes gradients to become very small when the input values are far from 0. This is particularly problematic during backpropagation because small gradients can lead to slow convergence or even complete stagnation of learning. ReLU, on the other hand, has a flat gradient for positive inputs, which can help mitigate the vanishing gradient problem and allow for faster learning.

  2. Non-linearity and Sparsity: While sigmoid provides non-linearity to the network, ReLU introduces a sparsity aspect. This is because ReLU outputs 0 for negative inputs, effectively deactivating the neuron. This sparsity can make the network more efficient by reducing the number of active neurons and simplifying the representation of data, which can enhance gradient flow through the network.

  3. Efficient Computation: ReLU is computationally more efficient than sigmoid. The sigmoid function requires exponentiation and division operations, which can be more costly in terms of computation compared to the simple thresholding operation of ReLU. This computational efficiency can contribute to faster training times.

  4. Avoiding Saturation: Sigmoid saturates to either 0 or 1 when its inputs are very large or very small, causing the gradients to be close to zero. In such cases, the network's weights don't update effectively, slowing down learning. ReLU does not saturate for positive inputs, allowing gradients to flow more effectively through the network.

  5. Initialization: Initialization techniques, like He initialization, have been specifically designed for ReLU activations. These initialization methods help prevent gradients from becoming too small during the early stages of training, promoting better gradient flow.

However, it's important to note that ReLU isn't without its drawbacks. It can suffer from a problem known as the "dying ReLU" problem, where a large portion of the neurons can become inactive and never recover during training. This issue has led to the development of variants like Leaky ReLU, Parametric ReLU, and Exponential Linear Units (ELU) to address the dying ReLU problem while retaining the benefits of improved gradient flow.

In summary, ReLU generally provides better gradient flow compared to sigmoid due to its sparsity, non-saturation, computational efficiency, and avoidance of the vanishing gradient problem. These factors collectively contribute to faster and more effective training of neural networks.

Lesson 3: Optimization of Deep Neural Networks

In [ ]: